Introduction to Natural Language Processing (NLP)

Generally speaking, Computational Text Analysis is a set of interpretive methods which seek to understand patterns in human discourse, in part through statistics. More familiar methods, such as close reading, are exceptionally well-suited to the analysis of individual texts, however our research questions typically compel us to look for relationships across texts, sometimes counting in the thousands or even millions. We have to zoom out, in order to perform so-called distant reading. Fortunately for us, computers are well-suited to identify the kinds of textual relationships that exist at scale.

We will spend the week exploring research questions that computational methods can help to answer and thinking about how these complement -- rather than displace -- other interpretive methods. Before moving to that conceptual level, however, we will familiarize ourselves with the basic tools of the trade.

Natural Language Processing is an umbrella term for the methods by which a computer handles human language text. This includes transforming the text into a numerical form that the computer manipulates natively, as well as the measurements that reserchers often perform. In the parlance, natural language refers to a language spoken by humans, as opposed to a formal language, such as Python, which comprises a set of logical operations.

The goal of this lesson is to jump right in to text analysis and natural language processing. Rather than starting with the nitty gritty of programming in Python, this lesson will demonstrate some neat things you can do with a minimal amount of coding. Today, we aim to build intuition about how computers read human text and learn some of the basic operations we'll perform with them.

Lesson Outline

Jargon
Text in Python
Tokenization & Term Frequency
Pre-Processing:
- Changing words to lowercase
- Removing stop words
- Removing punctuation
Part-of-Speech Tagging
- Tagging tokens
- Counting tagged tokens
Demonstration: Guess the Novel!
Concordance

0. Key Jargon

General

programming (or coding)
- A program is a sequence of instructions given to the computer, in order to perform a specific task. Those instructions are written in a specific programming language, in our case, Python. Writing these instructions can be an art as much as a science.
Python
- A general-use programming language that is popular for NLP and statistics.
script
- A block of executable code.
Jupyter Notebook
- Jupyter is a popular interface in which Python scripts can be written and executed. Stand-alone scripts are saved in Notebooks. The script can be sub-divided into units called cells and executed individually. Cells can also contain discursive text and html formatting (such as in this cell!)
package (or module)
- Python offers a basic set of functions that can be used off-the-shelf. However, we often wish to go beyond the basics. To that end, packages are collections of python files that contain pre-made functions. These functions are made available to our program when we import the package that contains them.
Anaconda
- Anaconda is a platform for programming in Python. A platform constitutes a closed environment on your computer that has been standardized for functionality. For example, Anaconda contains common packages and programming interfaces for Python, and its developers ensure compatibility among the moving parts.

When Programming

variable
- A variable is a generic container that stores a value, such as a number or series of letters. This is not like a variable from high-school algebra, which had a single "correct" value that must be solved. Rather, the user assigns values to the variable in order to perform operations on it later.
string
- A type of object consisting of a single sequence of alpha-numeric characters. In Python, a string is indicated by quotation marks around the sequence"
list
- A type of object that consists of a sequence of elements.

Natural Language Processing

pre-processing
- Transforming a human lanugage text into computer-manipulable format. A typical pre-processing workflow includes stop-word removal, setting text in lower case, and term frequency counting.
token
- An individual word unit within a sentence.
stop words
- The function words in a natural langauge, such as the, of, it, etc. These are typically the most common words.
term frequency
- The number of times a term appears in a given text. This is either reported as a raw tally or it is normalized by dividing by the total number of words in a text.
POS tagging
- One common task in NLP is the determination of a word's part-of-speech (POS). The label that describes a word's POS is called its tag. Specialized functions that make these determinations are called POS Taggers.
concordance
- Index of instances of a given word (or other linguistic feature) in a text. Typically, each instance is presented within a contextual window for human readability.
NLTK (Natural Language Tool Kit)
- A common Python package that contains many NLP-related functions

Further Resources:

Check out the full range of techniques included in Python's nltk package here: http://www.nltk.org/book/

1. Text in Python

First, a quote about what digital humanities means, from digital humanist Kathleen Fitzpatrick. Source: "On Scholarly Communication and the Digital Humanities: An Interview with Kathleen Fitzpatrick", In the Library with the Lead Pipe



In [ ]:

    
print("For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it's bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it's also bringing humanistic modes of inquiry to bear on digital media.")



In [ ]:

    
# Assign the quote to a variable, so we can refer back to it later
# We get to make up the name of our variable, so let's give it a descriptive label: "sentence"

sentence = "For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it's bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it's also bringing humanistic modes of inquiry to bear on digital media."



In [ ]:

    
# Oh, also: anything on a line starting with a hashtag is called a comment,
# and is meant to clarify code for human readers. The computer ignores these lines.



In [ ]:

    
# Print the contents of the variable 'sentence'

print(sentence)

2. Tokenizing Text and Counting Words

The above output is how a human would read that sentence. Next we look the main way in which a computer "reads", or parses, that sentence.

The first step is typically to tokenize it, or to change it into a series of tokens. Each token roughly corresponds to either a word or punctuation mark. These smaller units are more straight-forward for the computer to handle for tasks like counting.



In [ ]:

    
# Import the NLTK (Natural Language Tool Kit) package

import nltk



In [ ]:

    
# Tokenize our sentence!

nltk.word_tokenize(sentence)



In [ ]:

    
# Create new variable that contains our tokenized sentence

sentence_tokens = nltk.word_tokenize(sentence)



In [ ]:

    
# Inspect our new variable
# Note the square braces at the beginning and end that indicate we are looking at a list-type object

print(sentence_tokens)

Note on Tokenization

While seemingly simple, tokenization is a non-trivial task.

For example, notice how the tokenizer has handled contractions: a contracted word is divided into two separate tokens! What do you think is the motivation for this? How else might you tokenize them?

Also notice each token is either a word or punctuation mark. In practice, it is sometimes useful to remove punctuation marks and at other times to include them, depending on the situation.

In the coming days, we will see other tokenizers and have opportunities to explore their reasoning. For now, we will look at a few examples of NLP tasks that tokenization enables.



In [ ]:

    
# How many tokens are in our list?

len(sentence_tokens)



In [ ]:

    
# How often does each token appear in our list?

import collections

collections.Counter(sentence_tokens)



In [ ]:

    
# Assign those token counts to a variable

token_frequency = collections.Counter(sentence_tokens)



In [ ]:

    
# Get an ordered list of the most frequent tokens

token_frequency.most_common(10)

Note on Term Frequency

Some of the most frequent words appear to summarize the sentence: in particular the words "humanistic", "digital", and "media". However, most of the these terms seem to add noise in the summary: "the", "it", "to", ".", etc.

There are many strategies for identifying the most important words in a text, and we will cover the most popular ones in the next week. Today, we will look at two of them. In the first, we will simply remove the noisey tokens. In the second, we will identify important words using their parts of speech.

3. Pre-Processing: Lower Case, Remove Stop Words and Punctuation

Typically, a text goes through a number of pre-processing steps before beginning to the actual analysis. We have already seen the tokenization step. Typically, pre-processing includes transforming tokens to lower case and removing stop words and punctuation marks.

Again, pre-processing is a non-trivial process that can have large impacts on the analysis that follows. For instance, what will be the most common token in our example sentence, once we set all tokens to lower case?

Lower Case



In [ ]:

    
# Let's revisit our original sentence

sentence



In [ ]:

    
# And now transform it to lower case, all at once

sentence.lower()



In [ ]:

    
# Okay, let's set our list of tokens to lower case, one at a time

# The syntax of the line below is tricky. Don't worry about it for now.
# We'll spend plenty of time on it tomorrow!

lower_case_tokens = [ word.lower()  for word in sentence_tokens ]



In [ ]:

    
# Inspect

print(lower_case_tokens)

Stop Words



In [ ]:

    
# Import the stopwords list
from nltk.corpus import stopwords



In [ ]:

    
# Take a look at what stop words are included

print(stopwords.words('english'))



In [ ]:

    
# Try another language

print(stopwords.words('spanish'))



In [ ]:

    
# Create a new variable that contains the sentence tokens but NOT the stopwords

tokens_nostops = [ word  for word in lower_case_tokens  if word not in stopwords.words('english') ]



In [ ]:

    
# Inspect

print(tokens_nostops)

Punctuation



In [ ]:

    
# Import a list of punctuation marks

import string



In [ ]:

    
# Inspect

string.punctuation



In [ ]:

    
# Remove punctuation marks from token list

tokens_clean = [word for word in tokens_nostops if word not in string.punctuation]



In [ ]:

    
# See what's left

print(tokens_clean)

Re-count the Most Frequent Words



In [ ]:

    
# Count the new token list

word_frequency_clean = collections.Counter(tokens_clean)



In [ ]:

    
# Most common words

word_frequency_clean.most_common(10)

Better! The ten most frequent words now give us a pretty good sense of the substance of this sentence. But we still have problems. For example, the token "'s" sneaked in there. One solution is to keep adding stop words to our list, but this could go on forever and is not a good solution when processing lots of text.

There's another way of identifying content words, and it involves identifying the part of speech of each word.

4. Part-of-Speech Tagging

You may have noticed that stop words are typically short function words, like conjunctions and prepositions. Intuitively, if we could identify the part of speech of a word, we would have another way of identifying which contribute to the text's subject matter. NLTK can do that too!

NLTK has a POS Tagger, which identifies and labels the part-of-speech (POS) for every token in a text. The particular labels that NLTK uses come from the Penn Treebank corpus, a major resource from corpus linguistics.

You can find a list of all Penn POS tags here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Note that, from this point on, the code is going to get a little more complex. Don't worry about the particularities of each line. For now, we will focus on the NLP tasks themselves and the textual patterns they identify.



In [ ]:

    
# Let's revisit our original list of tokens

print(sentence_tokens)



In [ ]:

    
# Use the NLTK POS tagger

nltk.pos_tag(sentence_tokens)



In [ ]:

    
# Assign POS-tagged list to a variable

tagged_tokens = nltk.pos_tag(sentence_tokens)

Most Frequent POS Tags



In [ ]:

    
# We'll tread lightly here, and just say that we're counting POS tags

tag_frequency = collections.Counter( [ tag for (word, tag) in tagged_tokens ])



In [ ]:

    
# POS Tags sorted by frequency

tag_frequency.most_common()

Now it's getting interesting

The "IN" tag refers to prepositions, so it's no surprise that it should be the most common. However, we can see at a glance now that the sentence contains a lot of adjectives, "JJ". This feels like it tells us something about the rhetorical style or structure of the sentence: certain qualifiers seem to be important to the meaning of the sentence.

Let's dig in to see what those adjectives are.



In [ ]:

    
# Let's filter our list, so it only keeps adjectives

adjectives = [word for word,pos in tagged_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']



In [ ]:

    
# Inspect

print( adjectives )



In [ ]:

    
# Tally the frequency of each adjective

adj_frequency = collections.Counter(adjectives)



In [ ]:

    
# Most frequent adjectives

adj_frequency.most_common(5)



In [ ]:

    
# Let's do the same for nouns.

nouns = [word for word,pos in tagged_tokens if pos=='NN' or pos=='NNS']



In [ ]:

    
# Inspect

print(nouns)



In [ ]:

    
# Tally the frequency of the nouns

noun_frequency = collections.Counter(nouns)



In [ ]:

    
# Most Frequent Nouns

print(noun_frequency.most_common(5))

And now verbs.



In [ ]:

    
# And we'll do the verbs in one fell swoop

verbs = [word for word,pos in tagged_tokens if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']
verb_frequency = collections.Counter(verbs)
print(verb_frequency.most_common(5))



In [ ]:

    
# If we bring all of this together we get a pretty good summary of the sentence

print(adj_frequency.most_common(3))
print(noun_frequency.most_common(3))
print(verb_frequency.most_common(3))

5. Demonstration: Guess the Novel

To illustrate this process on a slightly larger scale, we will do the exactly what we did above, but will do so on two unknown novels. Your challenge: guess the novels from the most frequent words.

We will do this in one chunk of code, so another challenge for you during breaks or the next few weeks is to see how much of the following code you can follow (or, in computer science terms, how much of the code you can parse). If the answer is none, not to worry! Tomorrow we will take a step back and work on the nitty gritty of programming.



In [ ]:

    
# Read the two text files from your hard drive
# Assign first mystery text to variable 'text1' and second to 'text2'

text1 = open('text1.txt').read()
text2 = open('text2.txt').read()



In [ ]:

    
# Tokenize both texts

text1_tokens = nltk.word_tokenize(text1)
text2_tokens = nltk.word_tokenize(text2)



In [ ]:

    
# Set to lower case

text1_tokens_lc = [word.lower() for word in text1_tokens]
text2_tokens_lc = [word.lower() for word in text2_tokens]



In [ ]:

    
# Remove stopwords

text1_tokens_nostops = [word for word in text1_tokens_lc if word not in stopwords.words('english')]
text2_tokens_nostops = [word for word in text2_tokens_lc if word not in stopwords.words('english')]



In [ ]:

    
# Remove punctuation using the list of punctuation from the string pacage

text1_tokens_clean = [word for word in text1_tokens_nostops if word not in string.punctuation]
text2_tokens_clean = [word for word in text2_tokens_nostops if word not in string.punctuation]



In [ ]:

    
# Frequency distribution

text1_word_frequency = collections.Counter(text1_tokens_clean)
text2_word_frequency = collections.Counter(text2_tokens_clean)



In [ ]:

    
# Guess the novel!

text1_word_frequency.most_common(20)



In [ ]:

    
# Guess the novel!

text2_word_frequency.most_common(20)

Computational Text Analysis is not simply the processing of texts through computers, but involves reflection on the part of human interpreters. How were you able to tell what each novel was? Do you notice any differences between each novel's list of frequent words?

The patterns that we notice in our computational model often enrich and extend our research questions -- sometimes in surprising ways! What next steps would you take to investigate these novels?

6. Concordances and Similar Words using NLTK

Tallying word frequencies gives us a bird's-eye-view of our text but we lose one important aspect: context. As the dictum goes: "You shall know a word by the company it keeps."

Concordances show us every occurrence of a given word in a text, inside a window of context words that appear before and after it. This is helpful for close reading to get at a word's meaning by seeing how it is used. We can also use the logic of shared context in order to identify which words have similar meanings. To illustrate this, we can compare the way the word "monstrous" is used in our two novels.

Concordance



In [ ]:

    
# Transform our raw token lists in NLTK Text-objects
text1_nltk = nltk.Text(text1_tokens)
text2_nltk = nltk.Text(text2_tokens)



In [ ]:

    
# Really they're no differnt from the raw text, but they have additional useful functions
print(text1_nltk)
print(text2_nltk)



In [ ]:

    
# Like a concordancer!

text1_nltk.concordance("monstrous")



In [ ]:

    
text2_nltk.concordance("monstrous")

Contextual Similarity



In [ ]:

    
# Get words that appear in a similar context to "monstrous"

text1_nltk.similar("monstrous")



In [ ]:

    
text2_nltk.similar("monstrous")

Closing Reflection

The methods we have looked at today are the bread-and-butter of NLP. Before moving on, take a moment to reflect on the model of textuality that these rely on. Human language texts are split into tokens. Most often, these are transformed into simple tallies: 'whale' appears 1083 times; "dashwood" appears 249 times. This does not resemble human reading at all! Yet in spite of that, such a list of frequent terms makes a useful summary of the text.

A few questions in closing:

Can we imagine other ways of representing the text to the computer?
Why do you think term frequencies are uncannily descriptive?
What is lost from the text when we rely on frequency information alone?
- Can context similarity recover some of what was lost?
What kinds of research questions can be answered using these techniques?
- What kinds can't?